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ABSTRACT 


SYSC is a compiler for linear systolic array. SASP is one such 
linear systolic array under development at I IT, Kanpur. A 
high-level language called S2 is chosen for programming SA^ 
instead of assembly language used hitherto. SY^ translates 
programs written in S2 to assembly code of SASP. This thesis 
discusses the design and implementation of the back end of SYSC. 
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1. INTRODUCTION 


The different generations in computers have been marked by 
significant leaps in the speed of computation. The earlier 
generations were characterised by sequential and uniprocessor 
machines. The ever increasing demand for faster computers could 
not be matched by comparable advances in electrical technology. 
Parallel computing was an attractive solution to overcome this 
barrier. 

Several approaches have been taken to build parallel 
architectures, the choice often guided by the application on hand. 
Array processing is one such technique, which has gained 
tremendous recognition in modern signal and image prcNiessing 
applications. A systolic array is a practical realisation of this 
technique. 

The basic configuration of a primitive systolic model is 
schematically shown in Figure 1-1. This model has a sii^le 
synchronous protocol. With each beat of the clock, every cell 
receives data from its neighbours, computes with the data and 
outputs the results which are then available as inputs to its 
neighbours at the next clock. 

Host systolic architectures that were built in the early days 
of systolic computing were highly application specific and were 
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Figure 1-1 Basic configuration of a Systolic Array- 

found wanting in generality, in part due to the absence of prefer 
co^iler siv>port. In 1985, H T Kung and his group at the Carnegie 
Mellon University built a high-performance, programmable, general 
purpose systolic array, called Marp machine LAnnal. This was 
supported by a language called M2 with an c^timising con«>iler . A 
large variety of systolic algorithms have been implemented on Warp 
machine using this cookpiler CLetm.1. 

Closer home, at I IT Kanpur, a Systolic Array Signal Processor 
<SA^1 machine is being developed tUsmaril. A simulator with a 
meta -assembler is already available CWandl and a large number of 
algorithms have been implemented on this simulator CShormal. All 
these algorithms were handctxled using the micro-instructions of 
SASP. A mastery of the huge instruction set and a thorough 
knowledge of even the lower level details of the SASP machine were 
iiitf»erative for the programmer to code these algorithms. Asse^ly 
language programming is best avoidable; handcoding in systolic 
machines is all the more formidable, since excessive importance is 
given to cell -level details. This sidetracks the more important 
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issue of efficiently napping the algorithm to the systolic 
architecture. Hence, we preferred S2, a ncxlified version of W2 as 
the high-level programming language for SA^. The design and 
implementation of the back end of a ccmpiler for S2 were the 
objectives of this thesis. This compiler called SY^ takes a 
program in S2 and produces the microcode for the simulator. 

This dissertation is organised as follows : 

In Chapter 2 we present the description of the SA9P 

machine. 

Chapter 3 describes the machine abstraction given to the 
user and the programming language S2. 

The SYSC compiler structure is outlined in Chapter 4. 

The ccKie generator is explained in Oiapter 5. 

Chapter 6 describes the implementation issues of SYSC. 

Finally, we present the conclusions and list a few 
suggestions for future work in Chapter 7. 



2 . ARCHITECTURE OF SASP 


SA^ is a linear systolic array of conq^utat ional cells. The 
SfitSP architecture is similar to Uarp architecture. This chapter 
presents the architecture of SASP in detail. For a compiler 
writer, a good understanding of the architecture is essential. As 
far as the user is concerned, the cell -level details are, in a 
sense, irrelevant and he need only be conversant with a broad 
overview of SASP. The machine abstraction which hides the 
cell -level details is dealt with in Chapter 3. 


2. i Overview 

There are three major components in the system - the SASP 
array, the interface unit (IFU) and the host ZUsmanl as depicted 
in Figure 2-1. 

The SASP array performs conq^utat ion-intensive routines such 
as low-level vision routines or matrix operations. It is a 
one -dimensional systolic array with identical cells called SASP 
cells. Data flows through the array on two data paths CX and Y), 
while addresses Cfor local cell memories) travel on the Addr path. 
Control signals from the neighbouring cells are received through 
Cntrl channel. 

The IFU transfers data between the array and the host and 
regulates the flow of intermediate results through the array. It 
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also generates addresses for local cell menories and systolic 
control signals which are used in the computation of the cells. 


Host 


I 



Figure 2-1 SASP System Overview 

The host supplies data to IFU and SASP cells and receives the 
results from the IFU, in addition to executing those parts of the 
application program that are not mapped onto the SASP array- For 
example, I/O is handled by the host. 

Before the start of execution by the array, the host dumps 
the data onto the RAMs of the IFU and onto the data RWIs of each 
cell over the broadcast bus tBC-bus>. In the same way, the object 
code is dumped onto the microcode RAIIs of IFU and the cells. 
After this initial setu^, the IFU takes over control of all the 
blocks. It supervises routing and exchange of data during the 
execution of the program. After execution is over, it generates 
an interrupt, asking the host to take necessary action. Then the 
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host may read the results and may process the received results 
further. This is how a program is typically executed on the SASP 
system. 

2.2 Intercell communication 

In SASP machine, global communication is only through the 
Mi^-bus. However, during execution, only local comnunication 
channels are used. 

The cells use asynchronous communication primitives for 
communicating with each other - cells send and receive data to and 
from their neighbours through dedicated buffers. A receive 
operation retrieves a data item from the specified channel and 
stores it in the variable desired. Similarly, a send operation 
sends the value of the specified variable on the desired channel. 
A queue is associated with each channel CXQ, YQ and AddrQ) and is 
placed in the data path of the input cell. 

The semantics of asynchronous communication have been 
supported in SASP through dynamic flow of control. Mhen a cell 
tries to read frcxit an empty queue, it is blocked ti.e., machine 
cycles are skipped) until the data item arrives. Similarly, utien 
a cell tries to write into a full queue, it is blocked until a 
data item is removed from the full queue. Only the cell that 
tries to read from an eimpty queue or deposit a data item into a 
full queue is blocked. 
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2.2.1 CcNMMinicat ion channels 

The channels are as described below : 

X channel. 

It is a 32-bit wide data path and is unidirectional. It 
starts from the IFU and ends on the last cell. The data on which 
computation is to be done is transmitted over this channel and it 
ripples through the cells without being modified. 

Y channel. 

It is also a 32-bit wide, bidirectional channel and its 
direction can be statically reconfigured by the microcode. This 
channel forms a closed loop starting from IFU, running through the 
processor array and ending at the IFU. Intermediate or final 
results travel on this channel. 

Addr channel. 

It provides addresses for local data memories in the cells. 
Data independent addresses are generated in the IFU and 
transferred along with data on the X -channel. For example, when 
multiplying two matrices, each cell is responsible for computing a 
column of the result. All cells access the same memory location 
which has been loaded with different columns of one of the 
argument matrices. Therefore, common addresses and loop 
controlled signals can be generated in the IFU and propagated to 
all the cells. 

Cntrl chaiwMl. 

It contains control signals to read from or write into queues 
and the status of the queues of the neighbouring cells. 



The SASP array is integrated into a general purpose host 
(PC/XT>. The host uses BC_bus to load data and microprograms into 
each cell. 


2.4 Interface Unit tlFU) 

The block diagram of IFU is shown in Figure 2-2. 

The interface to host enables the host to access different 
parts of the IFU as well as the cells. 


Internal 
data bus 


Interface 
to Host 


licro Engine 


I Address 
bus to 


Address 


Data 
Memor y 


Memory 


YA Memory 


YB Memory 


Interface to 
Broadcast bus 
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Microengine. 

The control unit of IFU is a programmable microengine with a 
sequencer and an address generator. Control instructions/signals 
for each device are put together into a wide micro-instruction in 
central microcode memory and a micro-instruction is executed in 
every system clock cycle. 

During each cycle, the sequencer monitors the conditions and 
micro-instruction to determine the next micrc^rogram address. 

The address generator generates addresses on the address bus 
which flow systolically from cell to cell. It also has to address 
the data memory in the IFU. In a single instruction, the device 
can 

- output a 16 -bit memory address 

- modify this memory address 

- detect v^en the address value has moved to or beyond 
a preset boundary and conditionally loop back to the 
top of a circular buffer. 

HemcMry Structure. 

The interface unit consists of four types of memories. 

X Memcnry. 

The host can read and write into X memory. The IFU can also 
write into X i^mory from the data memory, whenever the previous 
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data loaded is over. A send operation on X channel writes the 
next data item of X me^ry into the X queue of cell 1. 

YA ^Mor y . 

The host can read and write into YA memory. This memory can 
be used for final or initial results. A send c^eration from the 
IFU on Y channel writes the next data item of YA memory into Y 
queue of cell 1 (or cell n depending on Y direction). When cell n 
(cell 1) executes a send operation on the Y channel, then the 
value is automatically stored in the next YA memory location. 

YB Memory. 

This is used for storage of intermediate results and can be 
accessed by the IFU and cell 1 (cell N) at the same time. The 
send and receive operations are similar to that of YA memory. 

The X, YA and YB memories are accessible only sequentially. 
The fourth memory is data RAM. 


2. S Cell Unit 

Each SA^ cell is a con^utational unit, having its own 
sequencer and program i^mory. 

The architecture of each SASP cell is different from that of 
a conventional processor. Here, all the functional units are 


11 


spread out to achieve maximum possible parallelism and are 
connected by a crossbar. The SASP cell has a wide instruction 
format, in v^ich dedicated fields control the independent 
functional units. 

The data path of a SASP cell is illustrated in Figure 2-3. We 
explain the salient features of the cell data path below. 

Hicroengine. 

The structure of the cell microengine is similar to the IFll 
microengine. 

Data independent addresses are generated in the IFU, i^ereas 
data dependent addresses are generated in the SASP cells. Thus, 
the address generator is used as the local address generation 
unit. 


Addresses to local data memory and scratch pad memory are fed 
from the address cross bar, in^ich has inputs from the address 
queue and address port of the address generator. The sequencer 
and address generator data ports get the data from the data cross 
bar. Thus, the data to these data ports can be from the constant 
field of the micro-instruction or from any of the inputs to the 
data cross bar. 

In a single cycle, a SASP cell can initiate two 
floating-point CHPerations, read/write a data item from data 


X Current 



Figure 2-3 Cell Data Path 
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memory, send and receive two data items from and to its 
neighbours, read three words and write two words to a register 
file (or vice versa) and conditionally branch to a program 
location. The machine has an orthogonal instruction set. This 
orthogonal instruction set with the long instruction word format 
exploits maximum parallelism at the instruction level CEllts]. 

Inter cell Ckimmunicat ion. 

Both the X and Y queues can receive data from the previous 
cell or from itself. For the Addr queue, no feedback is provided. 

Since SASP is a board design, reading of the local queue and 
writing to the queue of the neighbouring cell in a single cycle is 
difficult to realise in hardware. a send operation writes a 
data item into a queue through a register in two cycles. This 
explains the presence of X , Y and Addr registers. 

Cross bar. 

Internal data bandwidth is often the bottleneck of a systolic 
cell. In the cell, the two floating point chips can consume up to 
four data items and generate two results per cycle. Cross bar 
connects various data storage blocks supporting the high data 
processing rate. There are 5 input ports, 4 output ports and one 
bidirectional port. An output port can output data from any of 
the inputs, irrespective of the other outputs. 
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The input ports are 


XI 

— > 

from X queue 

YI 

“> 

frcMB Y queue 

const f 

— > 

from the constant (data! field of the 

microcode 

mresult 

— > 

from the multiplier output 

alu_spout 

--> 

from the ALU output 


The output ports are 


Ain 

--> 

to ALU input 

Bport 

— > 

to B port of the register file 

iout 

— > 

to the internal data bus of the 

microengine (16 -bit busJk 

yout 

— > 

input to the Y register 


The bidirectional port is to/from the data memory. 

Data storage Units. 

The local data storage units include a data memory, a 
register file and a scratchpad memory. 

Data MemcNry. 

The local data memory can be accessed in every clock cycle. 
It is generally used for storing data transmitted during the 
execution of a program. 

Register file. 

It contains 128 32-bit wide registers accessible from any of 
the S ports. Two ports are input ports <A and B>, two are output 
ports (C and D> and one IE) is bidirectional. 
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Register file can be used for storing the intermediate 
results and to transfer the data/results to/from the various 
units. 

ScratcN>ad nemary. 

The scratchpad memory can be used to hold scalars, floating 
point constants and small arrays. The addition of the scratchpad 
increases memory bandwidth and improves throughput from those 
programs operating mainly on local data. 

Computational Units (ALU and Multiplier!. 

Both ALU and Multiplier are 2-stage pipelined. Cells in a 
unidirectional systolic array can be viewed as stages in a 
pipeline. Thus the processor array supports pipelining at both 
the array and cell levels. This 2-stage pipelining greatly 
enhances the system throughput. 

The ALU supports addition, subtraction and different kinds of 
logical operations. 

The Multiplier initiates a new multiplication implicitly in 
every cycle with whatever data is present at its inputs. To 
execute a multiplication operation, data is given to the 
multiplier inputs and the result is taken out from the multiplier 
output after two clock cycles. 

This description of the architecture will aid in 
understanding the actions of the code generator. 


a MACHINE ABSTRACTION AND PROGRAMMING LANGUAGE S2 


This chapter outlines the machine abstraction for systolic 
arrays and programming language called S2 which supports the 
abstraction. 

3.1 The Machine abstraction 

The machine abstraction defines the lowest level details that 
are exposed to the user of the machine. The main goals of this 
machine abstraction are generality and efficiency. Hence, the 
model must allfsw the user to control the mapping of the 
computation across the array. He have adopted the machine 
abstraction proposed by CLom.]. 

The user sees the machine as an array of simple processors 
communicating through asynchronous communication primitives. We 
seek to expose the array level parallelism v^ile hiding the 
parallelism at the cell level. This is not unreasonable since 
automatic, effective problem decomposition is currently not 
possible for all computational domains, whereas compiler 
techniques are superior to hand coding when it comes to generating 
microcode for highly parallel and pipelined processors. 
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3.2 The S2 language 

The abstraction described above is supported by a language 
called S2. S2 is a modified version of W2 LLaatl that was 
developed for Uarp architecture. In this language, each cell in 
the systolic array is individually programbed in a FOflTRAN-like 
language. Cells communicate with their left and right neighbours 
via asynchronous communication primitives provided by the 
language. We wish to emphasize that the user has control only 
over the array level concurrency and not the system and cell level 
concurrency. 

S2 is a FORTRAN-like language with implicit data typing. The 
typing rules are as follows- 

1. All variables that access the data memory are floating point 
operands. 

2. An operation that involves floating point cw>erands yields 
a result that is also of type floating point. 

3. Any other variable is of type integer. 

Thus, loop indices and array subscripts are of type integer. 
Subscripting is permitted only in a single dimension because of 
limited addressing capability of the SA^ address generator. 
Further, only addition and subtraction are the permitted 
operations to calculate the array offsets. 

The control flow constructs provided in S2 are do-enddo. 


if -goto and goto 
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Figure 3-1 shows an exaBw>le of an S2 prograa that performs 
matrix multiplication of two matrices of size 3»2 and 2«3. 


module Hatr ixHultiply ( A,BrC > 


OEM B121 
XMEM Ai:63 
YAMEM CC93 


/* Each column of B matrix in cell data memory */ 
/• matrix A is stored in X memory of IFU •/ 

/• Result matrix C will be stored in YA memory 
» of IFU */ 


cello 

f 

do i — lr3 

row = Oj 

/• each cell computes the dot product of it^s column and 
» same row of A */ 
do j = 1,2 

receive(X,x,A}; 

send(X,x); 

row = row + x x BCjl; 
enddo; 

recei ve ( Y, temp , C) ; 
do k = 1,2 

receive <Y, temp,C>; 
sendCY, temp,C); 
enddo; 

sendtY, row,Cl ; 
enddo 

1 

Figure 3-1 An S2 program for matrix multiplication 


This program will be explained later 


19 


An S2 program is a module. A module has a name, a list of 
module parameters, and one or imre cell programs. The mcKiule 
parameters are like the formal parameters of a function - they 
define the formal names of the input and output from the array. 
The user ^ecifies the memory IcKzations of these par alters using 
reserved words. The SA^ machine has four kinds of data Emories 
and the user maps his data to these memories. 

XMEH ; X Memory of IFU 
YAMEM ! YA memory of IFU 
YBMEM s YB memory of IFU 
DhOI s Data memory of cell 

A cell program describes the actions of a group of one or 
more cells. Although the same program is shared by a group of 
cells, it does not necessarily mean that all the cells execute the 
same instruction at the same time, since a cell cannot start its 
cowutation until it has received the input from the preceding 
cell. 

Four types of statements are supported inside a cell program 
the assignRffint, coamunicat ion, conditional and iterative 
statements. The assignment, conditional and iterative statements 
have conventional syntax and semantics. Although iterative 
constructs (S2 has only do) can be simulated using conditional 
statements and the infinitely abusable goto, the sequencer of SASP 
handles them in a different way for reasons to beccmie clear later. 
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CowMinicaticm stateiwiits. 

There are two types of communication statements: receive and 
send. They are used to specify the interaction among the cells, 
as well as between the IFU and the end cells of the array. 

» receive (a, b, c> 
where 

a: : channel name (X or Y) 
b : a local variable 

c : an external CIFUl variable- must be a module parameter, 
receive removes a data item from a and stores it in b. The 
first cell of the array receives data directly from the IFU and 
the value is explicitly specified by c; all other cells receive 
the data transferred by the corresponding send c^eration of the 
communicating cell. 

» send (a, b, c) 
where 

a s channel name (X or Y> 
b : a local variable 

c : an external <IFU> variable - must be a module parameter 
send sends the value of b on a. In addition, the result from 
the end cell is stored into c. c is an optional parameter for 
send. For example, ccxMMsn data sent by the IFU and propagated to 
all the cells need not be stored back on IFU. 

The direction of the Y channel is set by a simple assignment 


statement 
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As an aside, it may be mentioned that S2 differs from W2 in 
the following respects- 

» explicit type declaration statements are absent in S2 
a nemory specification instructions are introduced in S2 to 
handle the four types of data memories in SA^. 

With this description of S2, we next consider the structure 
of SYSC, which is the tcH=>lc of the ensuing chapter. 


4. STRUCTURE OF SYSC. 


A compiler consists of tirfo major phases: a front end and a 
back end. The front end consists of phases to do lexical and 
syntactic analysis, for creation of a symbol table, semantic 
analysis and generation of intermediate ccxie. Since these phases 
depend primarily on the source language and are largely 
independent of the target machine, emphasis was given to the back 
end. It is assumed that a front end translates an S2 program into 
statements in an inter iradiate language called SIMPL-S. The back 
end of SYSC takes these statements in SIMPL-S as its input and 
generates the microcode for the SA^ machine. 


4.1 What is SIMPL-S ? 

SliiPL-S is an intermediate language in which the expressions 
of S2 have been broken dcwn into three-address statements. The 
control flow of the original S2 program remains unaltered. Hence, 
SII*PL-S can be considered as an intermediate language of 
relatively high-level three-address statements. The syntax of 
SII*PL-S is given in Appendix A. 

The Sira*L-S equivalent of the matrix multiplication program 
explained in the previous chapter (Figure 3-1) is given in Figure 


4-1 
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do i = 1,3 

row = O 
do J = 1,2 

receiveCX, x,A> 
send(X,xl 
tl = BCjl 
t2 = X * tl 
row = row + t2 
enddo 

receiveCY, tenp,C) 
do k = 1,2 

receiveCY, teflif>,C> 
sendlY, tenp,C> 
enddo 

send (Y, row, C3 

enddo 

Figure 4-1 SIhff*L-S Version of Matrix Multiplication 

Program of Figure 3-1 

We wish to emphasise that only the compl&x expression, 
row = row + BCj3 » x 

has been broken down into the three-address statements, 

tl = BCjl 
t2 = tl » X 
row = row + t2 


4.2 Structure of SYSC 

The different phases of the SYSC compiler are shown in Figure 

4-2. 
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4.2.1 The Frcttit End 

f^art frtm lexical and syntax analysis, this phase stores the 
three address statements of SIWL-S as quadruples. Although the 
front end is not implemented in SYSC, for the purpose of testing 
the back end, a scanner is written which takes the input in 
SIWL-S syntax and outputs it in a form acceptable to the 
data-flow analyser. 

4.2.2 The Data Flow Analyser 

From the quadruples generated by the front end, a control 
flow graph of basic blocks is constructed. 

For effective code cw>timization, generation and scheduling, 
it is imperative that a compiler collect information about the 
program as a whole and distribute it to each block in the flow 
graph. Although code optimization is not i^lemented, the data 
flew analyzer is designed with this in mind. For an exhaustive 
treatment of this topic, the interested reader is referred to 
lAhol. The output of the analyser is a flow graph with data flow 
information. 

Ue assure the reader that these book-keeping chores are by no 
means a trivial task. Since the code generator makes extensive 
use of the information collected by this phase, a significant 
portion of the thesis work was devoted to a careful design of this 


phase 
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4.3 IFU Fliaw-Graph Extractor 

SYSC has to generate separate code for IFU and the cells. The 
IFU flow graph extractor outputs a separate flow graph for IFU 
from the flow graph obtained from the data flow analyser. The 
extractor scans the quadruples in a basic block one at a time and 
constructs the corresponding IFU quadruple if necessary. The only 
quadruples of interest are those Vi^ich could force the IFU to 
participate in communication with cell 1 (or cell N, depending on 
the Y direction!. All the remaining quadruples are discarded by 
the extractor. Hence, the actions of the extractor can be 
classified into the following four cases depending on the input 
quadruple. 


» receive s For every receive encountered, the extractor 
should generate a corresponding send statement 
so that at the time of execution, the IFU may 
send the necessary data to cell 1 (cell N) in 
the array. 


Statement indicating 

the start 

of 

a loop 

(block) . 





The nundier of receive 

operations 

in 

a cell 

program should 

match 

the number 

of 

send 

operations in 

the IFU. Itence 

for 

a do 

statement of a 

block 

consisting 

of 

receive 

statements, the 

extractor generates a 

replica 


of the do quadruple. 
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K encido : Statement indicating the end of a loop 
(block). 

If the corresponding do statement has been 
replicated in the IFU, so should the enddo. 

a larray : Statements of the form ACil = ra 
rarray : Statements of the form m = ACil 

All array references indexed only by 
expressions of loop indices are considered 
data independent and are calculated in the 
IFU. The address calculation in the cell for 
these array references will be replaced by 
receive operations from the address queue. 
For the IFU, the extractor generates a 
quadruple for sendCAddrQ, i,A) . 

The extractor need not have to generate a receive statement 
corresponding to a send statement in the cell N (cell 1) since a 
send operation in the cell N (cell 1> writes automatically into 
the IFU data memory. 

For the semantics of the program to remain valid, the sends 
and receives on the two branches of a conditional should follow 
the same sequence. 
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4.4 Code Generatcnrs 

The code generators translate the input flow graph into 
microcode for the corresponding unit, IFU or cell. The detailed 
description of ctKie generation is the subject matter of the 


next chapter 


5. THE CODE 




RATOR 


The final phase of SYSC involves the ccKie generators for IFU 
and the SASP cell. These code generators translate the flow graph 
into the corresponding microcode. The microcode is a sequence of 
very long instruction words, each of which consists of dedicated 
fields that independently control the functional units of the SAS* 
machine. Me first describe the cell code generator in detail. 
The IFU code generator has a similar but much simpler design. 

5.1 The Cell Code Benerator 

Me presented the SASP architecture in Chapter 2. Here, we 
take a closer look at the SASP architecture, since familiarity 
with the target machine and its instruction set is a prerequisite 
for develcw>ing any code generator. 

5.1.1 Instructions of SASP 

The fields in a SASP cell instruction are - 

1 . Sequencer 

2. Data field (constant! 

3. Address generator 

4. Flag i/p to sequencer 

5. Scratch pad memory 
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6. X -queue 

7. Y-queue 

8. Address -queue 

9. Data i^mory 

10. Cross bar selection 

11. Register file 

12. ALU 

13. Multiplier 

Each unit of the cell is controlled by its instruction field 
in this long instruction word. For a complete description of the 
instruction set, the interested reader is referred to the SAS* 
simulator manual LWandl. 

5.1.2 Data movements 

Most of the instructions generated by the ccum^iler are used 
for data transfers in the cell data path. Following are the 

possible valid data movements in one machine cycle among different 
units in a SASP cell. 

Source List of all possible destinations 

1. XQ X/Y register, ALU inputs, multiplier inputs, 

data memory, register file input ports 

2. YQ Y register, ALU inputs, multiplier inputs, 

data memory, register file input ports 

3. Data memory Y register, ALU inputs, multiplier inputs, 

register file input ports 

4. MLUSPOUT ALU-B input, Y register, multiplier inputs. 
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5. MRESULT 

6. Register 
file 


data memory r register file input ports 

Y register, ALU inputs, multiplier inputs, 
data memory, register file input ports 

Y register, ALU inputs, multiplier inputs, 
data memory, register file input ports 


Each of these possible data movements is mapped into a set of 
micro -ever at ions. Throughout the rest of this chapter, we shall 
refer to such data movements that are possible in one clcKik cycle 
as singl&shot trccnsf&x-s, 

5.1.3 Address Descriptors 


The 

symbol 

table 

entry 

for 

each name 

has 

an 

address 

descr iptor 

field 

v^ich 

gives 

the 

location 

v^ere 

that 

dat um 

resides. 

A data item in 

SASP 

can 

reside in 

any 

one 

of the 


following possible locations. 

1. Data Memory 

Variables that reside in data memory of the cell are 
specified by the user. So, at the parsing stage itself, we can 
get the addresses of these names and enter them into their 
respective address descriptors. These addresses remain the same 
throughout the pre^ram. 

2. Register File 

Unlike conventional machines, the registers in SASP are 
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primarily used for transferring data frcwi the queues (X and Y> and 
Data nenory to the ALU and the multiplier. With 128 registers in 
the register file, problems like register shortage are rarely 
encountered, in most applications. For this reason, we have not 
discussed about register allocation. In fact, for all practical 
purposes, an infinite register model is assumed. 

3. Crossbar Ir^ut 

The SASP code generator does not route a datum through the 
cell data path unless it is required. The exaw>le given below 
explains the strategy. 

Consider the following sequence of statements : 

m 

m — X X 1 
sendCY,m> 

m 

m 

If m has not been assigned any location, then it can be 
stored in a register of the register file. I#ienever m is used 
later, we can route it to the required location through the 
register file. Clearly, this is a siiw^le, but circuitous way of 
doing things resulting in a lot of redundant data movement and 

increased code size. For instance, the above example involves two 

single-^»ot transfers, one from MRE^A.T to the register file and 
another from the register file to YQ. This can be avoided by 

depositing the multiplier result at the input to the crossbar. 


In general, a statement leaves its result at the crossbar 
input if its address descriptor is empty. To leave a result at 
the crossbar input, the crossbar input should be free. Otherwise, 
if it contains a live variable, the variable is stored in a 
register with its address descriptor appropriately modified. 


The list of statements v^ich leave their results at the cross 
bar input is s 


Statement 


Crossbar input 


1 . 

2 . 

3 . 

4 . 

5 . 

6 . 


receive CX,x> 
reccivetY, y) 
a = b+c 
a = bxc 
a » mCi3 


XI 

YI 

ALUSPOUT 

HRESULT 

DMOirr 


a = k /» k: constant «/ CCMsiSTF 


This necessitates a descriptor for each of the crossbar 
inputs, pointing to the symbol it contains. 

5.1.4 Address calculation for data memory elements 

If the syi^ol is a simple variable, then its address in the 
data memsry is directly obtained from its address descriptor. 

If the symbol is an element of an array, it can be either 
data independent or data dependent. 

For a data independent address, an instruction to read the 
address from the address queue has to be generated. Also, the 
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read address has to be written into the address queue of the next 
cell through another instruction. 

A data dependent address will be calculated by the local 
address generator. 

The local address generator 

The local address generator has four sets of registers: 

» 16 address(R) registers, to output the calculated 
address 

» 6 offset (O) registers, to store the value of the index 

» 4 cc3Bpare(C> registers, to check the bounds of the 
array 

» 4 initialisationCIl registers, to store the base 
address of the array 

A descriptor table is maintained by the SYSC code generator 
for each of the above register sets. 

Consider the exafl^le given below, which illustrates a data 
dependent address calculation. 

do i»i,10 

receive(X,x> 

J = i - X 
■ = ACjl 


enddo 


35 


Since the value of j depends on the value received through 
the X channel, the address of ACj3 is data dependent. The address 
generation procedure for ACj3 involves the following steps : 

- Set a new address register, say R. 

X 

- Let the base address of A be in I . 

j 

Generate the instruction to move the contents of I. 

J 

to R. 

X 

- Let the value of j be in O. 

k 

Generate the instruction sequence for R. = R. + O, 

^ 1* k 

- Output R. . 

X 

5.1.5 Register assignment 

As mentioned before, the large number of registers in the 
SA^ machine usually preclude the possibility of register 
shortage. Me have further optimized the use of registers by the 
following strategies : 

1. At the end of every basic block, registers containing 
variables that are dead are freed. 

2. Unnecessary storage in the register file is avoided 
by lossy id»lay&di r&adiTig of the crossbar inputs. 
This effectively uses the crossbar inputs as 
teworary storage locations. 
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5.1.6 Sequencer 


This has a set of instructions which alter the sequence of 
the progran depending on the flag. It has 4 counter (C) registers 
to support the do statement. 


5.2 The Code Generation Algorithm 


With this background, we are now ready to tackle the heart of 
the matter, code generation. The functions and procedures useful 
for this algorithm are listed below. The main algorithm , per- s&, 
is then presented. 

function BET_SRC_ADDR(x) 

/» gets the address of x from its address a 

» descriptor */ 

function SET_DST_ADDRCx) 

/% gets the address of x from its address h 

a descriptor, if it is empty, it selects a % 

» IcKiation for x depending on the quadruple and % 

» returns the address of this Icxzation »/ 

procedure aiC_2J>ST(src, dstl 

/» generates a sequence of single -shot transfer » 
a instructions to route the data from source to » 

» destination »/ 

procedure QUAD_COi:E!(quad> 

/x Input :: A quadruple » 

^ Output :: A list of instructions »/ 
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begin 

case (type of quad) of 

sendCX , x7 s 

begin 

src s= BET_SRC_ADDR(x )5 

/» can be in X register a 
a or XI a/ 

SRC_2_DSTCsrc, X_register>j 

Generate an instruction to write to the XQ 
of the next cell; 

end; 

serxdCY , : 

begin 

src := eET_SRC_ADDR(y); 

/a can be a const ant, a a 

a crossbar input, a data a 

a meroory/register file a 

a element a/ 

SRC_2_DST (src, Y_register ) ; 

Generate an instruction to write to the YQ 
of the next cell; 
end; 

r©c©(r>eCX, x7 : 

begin 

dst := GET_DST_ADDR(x>; 

/a can be data meniory or a 

a XI a/ 

SRC_2_DST(XI, dst); 

end; 

x-eceiT^eCY, yJ> s 
begin 

dst s= BET_DST_ADDR(y); 

/a can be data loemory or a 

a YI a/ 


SRC_2_DST t YI , dst ) f 

end; 

a[iJ = m. ! 

begin 

src s= 6ET_SRC_ADDRCiii); 

/3« can be a cross bar k 

» input, a constant or a 
a register file element »/ 
SRC_2_DST tsr c , DH> ; 
end; 


m = ctltJ : 

begin 

dst != l^T_DST_ADDR(m); 

/» can be a register file » 
a element or DMOUT »/ 

SRC_2_DST i DATAJ1EI10RY , dst ) ; 
end; 

a = b o c : 

/x o is an ALU operation x/ 

begin 

src s= BET_SRC_ADDRtbl; 

/x can be a constant, a x 

X crossbar input, a data x 
X memory /register file x 

X element x/ 

SRC_2_DST (sr c , ALU_AINPUT J ; 
src ;= GET_SRG_ADDRtcl; 

SRC_2_DST (sr c , ALU_BINPUT> ; 

generate the code for the operation o ; 

dst = 6ET_DST_ADDR(al; 

/x can be a register file x 
X element or data memory x 
X or ALUSPOUT x/ 


39 


SRC_2_DST CALUSPOUT, dst } ? 
end; 

CL = b * C ! 

begin 

src := GET_SRC_ADI>R(b); 

can be a constant, a » 

^ crossbar input, a data ^ 

« roemory/register file » 

« element */ 

SRC_2_DST tsrc , MUL_AI W*UT> ; 
src != GET_SRC_ADDR(c); 

SRC_2_DST (src, MULJSINPUT) ; 
dst = GET_DST_ADDR(aJ; 

/» can be a register file » 

% element or data memory » 

3f or MRESULT »/ 

SRC_2_DST (HRESULT, dst > ; 
end; 

da t=i ,Ti ! 

begin 

Get a counter in sequencer, say C. ; 

V 

Generate the instruction to load C. with 

V 

the loop count; 

If i is used later for address calculation, 
store it in an offset register of Address 
generator ; 
end; 

&ndda : 

begin 

Generate the instruction to decreikent the 
counter asscxiiated with the present loc^; 


end; 
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tf corvd than bloch_nim». : 

begin 

Generate the instructions for ALU to 
perform the conditional operation; 

Generate the instruction for the sequencer 
to select cond and perform the associated 
action; 

end; 

goto bloch._nxmi. s 

Generate the instruction for sequencer to 
jump to block._n-um. unconditionally; 

end. 


THE MAIN ALGORITHM : 


tiriiile 3 a block do 
begin 

for V quads in the block do 
QUAD_C0DE<quad ) ; 

free the address descriptors of dead 
variables; 

store the remaining variables at the 
crossbar inputs in the register file; 

move the loop independent instructions 
outside the block; 

end. 

The cell ccxie generator generates the microccKie for each cell 
in the GASP array. The microcode of all the cells except the end 
cell has an additional instruction at the end to wait for the 
completion of computation on the end cell. 
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5.3 Code Generation For IFU 

The code generation for IFU is simple since it does not 
involve any cross bar and functional units <ALU and multiplier!. 
It Just has additional data memory in the form of Xi*01, YAMEM and 
YBMEM. 


The only extra quadruple in IFU is a quadruple of the form 
sendtAddrQ, i, A>. This quadruple occurs in a do loop. Let us 
say the index of the do loop is i. The IFU has to send the 
data independent address of AEil to cell 1 through the AddrQ. 
Thus, the actions of the code generator for this quadruple are - 

» get a new address register, R. 

» Generate the instruction to store the base address of 
array A in R. and move it outside the present block 

V 

se Generate the instruction to send the value in R. to 
the address queue of cell 1. 

Since the updated address has to be sent in each pass of the 
do locv , the IFU code generator generates instructions to 
increment the value of all data independent addresses (here ACi3> 
when it encounters a enddo quadruple. 

Except for this address generation issue, the IFU code 
generator follows the same outlines as that of cell code 


generator 


6. IMPLEMENTATION OF SYSC 


Me enui^rate below certain implementation related details of 
the SYSC compiler. 

6.1 Quadruple Constructor 

In the absence of a full-fledged front end, for purposes of 
testing the back end, a simple scanner for SIMPL-S statements has 
been written. A symbol table entry for an identifier has the 
following structure : 

» name : name of the identifier 

» type : type of the identifier 

« val s value of the identifier, if any 
a defset : set of all definitions of the variable 

a useset : set of all uses of the variable 

a addr : address descriptor 

The information about every SIW*L-S statement is stored in a 
separate node whose contents include : 

» Ouad : the associated quadruple 
» Of low : data flow information 

« Insir : pointer to the set of instructions generated 


for the quad 
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Quadruples are constructed at the tirost of parsing. A 
quadruple is a record structure with four fields, which we denote 
by oPt argl, arg2 and result. The op field contains an internal 
code for the operator. The other fields point to the sy^ol table 
entries of the corre^onding data elements. 


6.2 Representat ion of Basic Blocks 

Each basic block contains a doubly linked list of quadruples. 
This representation is ideal since data flow analysis involves 
both backward and forward motions over the flow graph. In 
addition, the transformations on basic blocks (not iiw>lemBnted in 
SYSC) involve the movement of quadruples, their deletion and 
insertion. A basic block entry contains pointers to the leader 
and trailer quadruples and a list of its predecessor and successor 
basic blocks. 


6.3 Data Flow Analyser 

The data flow analysis implementation adheres to the approach 
taken in LH&chtl. The iterative algorithms for data flow analysis 
make extensive use of set manipulation. Thus, an efficient 
representation for a set would be a sequence of contiguous bytes, 
where the first byte stores the size of the set and the remaining 
bytes actually represent the set. The data flow information 
associated with a node are : 


» reaching definition set 
» live variable set 

« definition nuiriter and du-chain, for the result field 
of quadruple 

« use-number 1, ud-chainl and next-usel, for the argl 
field of quadruple 

a use-number2r ud-chain2 and next-use2r for the arg2 
field of quadruple 


6.4 Code Senerator 

The implementation of the code generator involved excessive 
bcMDk-keeping tasks. Since much care was taken during the design 
of the code generation algorithm, the implementation issues were 
covered by it. An exhaustive list of data transfers in the cell 
data path were mapped to the corresponding instruction sequence of 
SASP. In effect, the code generation algorithm itself is 
independent of the instruction set of SA^. 


6.5 Testing SYSC 

The SYSC compiler was develc^ed on a SlRsi3 machine. The code 
was written in C and ran into nearly 6000 lines. 

To test SYSC, ve selected a few algorithms and vnrote programs 
for them in SIMPL-S. These programs vwre compiled using SYS^. As 
mentioned earlier, SASP is under development but the simulator is 



45 


available. The output of was the nicrc]«>rograBk for the 

of the simulator. The micrc^rograns were assented 
using the meta-assembler . 

Using simulator commands, we loaded the data into IFU 
memories and the data memory of the cell. Then the object code 
output of the meta-assei^ler was run on the simulator. 

We describe below the S2 programs for tidiich the above test 
was done. 

6.5.1 Matrix multiplication program 

The S2 program for this is given in Figure 3-1. 

Let A and B be two matrices of size MxN and Nsdt reject ively. 
Matrix A is stored in the X memory of the IFU rcxrf by row. Matrix 
B is stored in the data memory of the cells, one column per cell 
(we have a total of K cells). Matrix C is stored in the YB memory 
of the IFU, with zeros as the initial values. The mapping of the 
array elements is shown in Figure 6-l(i). 

The IFU sends the X memory data on X channel sequentially. A 
cell N computes a column of the result matrix using this X data on 
the X channel and sends the result to the cell N+1 cjn Y channel. 
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Ci> Mapping of input data to the array 



At the end of computation of an element of a column, cell N 
receives K elements from cell N-1 on Y channel. Out of this, it 
sends K-1 elements unchanged to cell N+l and sends the element 
computed by it as the Kth element. The same action is performed 
by all the elements in the array and the IFU gets one row of 
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result flietrix at the end of sends by cell K. Figure 6. ICiil 
illustrates this. The arrows on the left and right of the 
statement indicate the receive and send operations respectively. 

The receives are ahead of sends by a count of 1. The 
microcode is analysed in Chapter 7. 

6.5.2 Polynomial evaluation 


Suppose we want to evaluate the polynomial, 

Ptx) = C x”* + C x"*”^ + ... + Cx + C 

TO tn-l 1 O 

By Horner '^s rule, the polynomial can be reformulated from a 
sum of powers into an alternating sequence of multiplications and 
additions : 


P(xl = iiC X + C )x + ... + C )x + C 

TTI TH-1 1. O 

The S2 program for solving polynomials of degree 4 is given 
in Figure 6-2(i>. The values of x are stored in the X memory of 
the IFU. The coefficients are stored in the cells, one per each 
cell tsee Figure 6-2(ii)). The mode of computation is 
straightforward. The steady state of the computation is depicted 
in Figure 6-2Ciii>. The SYSC output for this program is given in 


Appendix B 


*- coefficient stored in the cell »/ 

YAMEH ACS] /w velues of the polynomial »/ 

Xi'^H BC5] / * different x values st/ 

celKl 

€ 

/sf Compute the polynomials */ 
do i=i,5 

receive(X,x,B]; 
receive (Y,y,Als 
partial = y * x + Cj 
sendCX, x); 
sendtY, partial, A); 
enddo; 


( i ) The S2 program 


CELL 1 


CELL 2 


CELL 3 



X 







Y 

^4 


c 

3 

■-■■4. .... V 

% 

X 

3 

^4 




i 

! ; 

"“P*” •*" ' 'W 



(ii> Mapping of data to the array 


C I out . 


X = X. 

out \n 

Y = Y. » X. + C 

out vn tn 


(iii> Steady state of Ccmputation 




Figure 6-2 Polynomial Evaluation Program 
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6.5.3 Convolution of two sequences 


Let us consider the convolution of two sequences xtn> and h(n), 
Y(i) = x(k> » hCi-k) 1 <= i < 2n - 1 


The S2 program for finding the convolution of two sequences, 
x(3> and yt31 is given in Figure 6-3(i). The input sequence xtnl 
is kept in X memory and the weights htn) are kept in the cell 
array as shown in Figure 6-3(iil. The initial results are kept in 
YA memory. In this case, these are all zeros. 


The computation of the result is illustrated in Figure 
6-3(iii). 


XfEM xC3] /x the sequence xCnl «/ 

YAHEM yC51 /» the convolution output «/ 

Di^n h /% an element of the second sequence, hCnl 

cello 

€ 

receivetY, y 1, y> ? 
do isi,3 

receive (X, xdata, xl i 
send(X,x)| 
receivetY, ydata, y> ; 
sum = h X xdata + ydatai 
sendCY,sum,y>; 
enddo; 

receive (Y, ydata, y>f 
sendCY, ydata, yl I 
send(Y,yl,y>; 

> 

(i) The S2 program 

Figure 6-3 1 - D Convolution Program 
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(ii) Mapping of data to the array 


Cell 1 


Y,yl = y^= O 
X,xdata=x — ► 


Y,ydata=y^= O 
2 


Y,sum = h^» x^- 


X, xdata=x — ► 

2 

Y, ydata=y = O 


Y,sum = h » X — ► 
1 . 2 


X, xdata=x. 


—*■ Y,ydata=y^= 0 

Y,sum = h St X — ♦ 
1 a 


— ► Y,ydata=y = O 

SS»' 

Y, y^= O -♦ 


Cell 2 


Y, y = h St X 
± 1 1 

X,xdata=x — ♦ 


Y,ydata=h st x 
12 


Y,sum = h St X 
1 2 

+ h St X - 
2 1 

X,xdata=x —* 


Y,ydata=h st x 
± a 

Y,suir = h st X 
± a 

+ h st X,- 
2 2 

X, xdata=x_— 

a 

Y, ydata=0 


Y,sum = h st x^- 
2 a 


Y, ydata=0 


Y,y^= h,* X.- 


Cell 3 


Y,y=hstx+hstx 
1 1 2 2 1 


X, xdata=x 


Y, ydata = h st x 

SL 3 

+ h st X 
2 2 


Yysuia = h st X 

13 

+ hstx+hstx- 

2 2 3 1 


X,xdata=x 


Y, ydata = h » x^ 
2 a 

Yysum = h st X 
2 a 

+ h st X - 
a 2 


X, xdata=x 

a 

Y, ydata = O 


Yrsua = h st X — ♦ 
a a 


Y, ydata = h^st x^- 

+ h st X — ♦ 
2 i 


Ciii> Computation on the array 


Figure 6-31 - D Convolution Program tContd. ) 
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6.5.4 Sorting a sequence 


DMEM C 
YAMEH ACS] 
cellC) 


/« element retained in the cell «/ 
/« the unsorted sequence «/ 


i 


C = O; 


/jf retain the largest element in the sequence »/ 
do i»l,5 

receive(Y,y,A); 
if C > y goto 20; 
send(Y,C,A>; 

C = y; . 


1 - 1 . 


goto 30; 

20s sendCY, y,Al; 


4cc. 


] J3t<i 


- f 


30: enddo; 


Figure 6-4. S2 program for systolic sorting 


We want to sort a sequence y(n> whose elements are greater 
than O into non -decreasing order. Before computation begins, a 
zero is kept in the data memory of each cell. y(nl is stored 
initially in the YA memory. As it passes through the array, each 
cell retains the largest elen^nt in the sequence received and 
sends the remaining values to the next cell. At the end of the 
computation, the ith element in the sorted sequence is found in 
cell i. The S2 program to achieve this is given in Figure 6-4 and 
the SYSC output is given in Appendix B. 



7.C0NCLUSI0NS 


A systolic array compiler called SYSC has been designed and 
its back end implemented. Systolic algorithms which are mapped 
onto SASP can be written in a high-level language called S2. The 
assumed front end translates these S2 programs into equivalent 
statements in an intermediate language called SIHPL-S from vrtiich 
the back end takes its input. We conclude the discussion on SYSC 
with a fair assessment of the work done and suggestions for future 
work. 


7.1 Benefits of SYSC 

The current style of programming on SASP is coding in 
assembly language. For this purpose, the programmer has to 
acquire a firm grasp over the machine details and master the 
instruction set. A lot of effort is thus spent on these details 
rather than concentrating on the more important issue of mapping 
the algorithm onto the SASP array. With the availability of SYSC, 
the programmer is relieved of the onerous task of understanding 
the machine details. With only the SASP array architecture 
visible to him, programming in S2 is much irnsre intuitive. 
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For example, the assembly language written by a typical 
programmer without SYSC's support is as shown in Figure 7-l(i). 
Figure y~lCii) shows an S2 program for the same task. 

It is seen from this figure that in the S2 program, there is 
no separate program for the IFU. SYSC generates the code for IFU 
after extracting the relevant parts from the data flow graph 
constructed by it. 

The listing of some S2 programs is given in Appendix B. 

7.2 Quality of the output 

The code generated by SYSC is efficient to a certain degree. 
As explained in Chapter 5, SYSC's code uses minimal number of data 
transfers, i.e. one-^ot transfers on the shortest possible path. 
Thus, the code resembles that which would have been written by an 
experienced microcode programmer vrtio is ignorant of the principles 
of compaction. 

Me give a comparison of a typical hand coded microprogram 
with the output of SYSC for a corresponding S2 program. The 
matrix multiplication microprogram given in Figure 7-2(il is from 
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li> Typical program written fay an assembly language programmer 
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Me observe the following differences. 

» The hand coded program does not utilise the full power of 
the machine. In the above hand coded program, the address 
computation for data independent addresses are computed in 
each cell, which reduces the speed of computation. This is 
the case with most of the hand coded microprograms on SASP 
[Shou'nxxl . In the SYSC output, address generation for data 
independent addresses are computed in the IFU and 
prcv>agated through the address channel to all the cells, 
reducing the computational burden on the cell. 

« The code size produced by SYSC is larger and takes more 
time. This is so, because of the absence of a scheduler. 
With a scheduler, we can get as good a code as a hand coded 
one. We do not envisage any conceptual difficulty in the 
implementation of a scheduler for SYSC - because of a time 
crunch, we could not do it in this thesis. 


7.3 Suggestions for future work 

To be sure, this compiler is not complete in all respects. 
For SYSC to become coiiw»letely operational, a front end and a 
scheduler are mandatory. More parallelism than what SYSC extracts 
can be obtained by overlapping the laperations frewB different basic 
blocks. The WARP compiler, for instance, employed the techniques 
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of software pipelining and hierarchical reduction for its 
scheduler (CLoiftl). The interested reader is also referred to 
[Fisheries dissertation, for a comprehensive evaluation of 
scheduling techniques. 

W2 and S2 are high-level programming languages, which hide 
the cell architecture from the user but expose the array 
architecture. In these languages, the user programs each cell 
individually and manages inter cell communication explicitly. 
Mapping scientific programs to programs in the above languages is 
a nontrivial task and an area of ongoing research. In fact, a 
compiler is being developed (CTsen^l) for a high-level language 
AL, in which the user views the entire systolic array as a 
sequential machine and the compiler generates W2 programs with 
inter cell communication. Although this compiler extracts the 
parallelism from only a few programs, a similar task for SASP 
would be an interesting exercise. 
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APPENDIX A. SYNTAX OF SIMPL-S. 


The syntax of 

dir_st«t 


stateiaent 


label led_stmt 

assign_stmt 


cond_st»t 


SIMPL-S is given below in BNF notation. 


ydir = LEFT 
1 ydir = RIGHT 

: : = label led_stmt 
! assign_st»t 
! cond_stmt 
I commn_stnt 

• iter_stmt 

label statement 

: : = var = - arg 
1 var = arg 
1 var C arg 1 = arg 
I var = var C arg 3 
5 var = arg op arg 

var = arg rel_op arg 

• var = ! arg 

1 var = logical_const 
i goto integer 

! if arg rel_cM’ erg goto integer 
’ if arg goto integer 

! : = send ( X , var ) 

! send ( Y , float_arg, var ) 

• receive ( X r var r var 1 
{receive ( Y , var , var > 


commnstmt 



iter„stmt 

reljpp 

op 

arg 

f loat„arg 


do var = 1, integer stmt_list enddo 
! ' !< J <=!>!>=! == 

: : = + ! - ! jt j / 

::= var I integer ! float 

: := var ! float 



APPENDIX B. SYSC OUTPUT FOR TEST PROGRAMS 


1 . Matrix multiplication program 

/M Results for the matrix multiplication 
Figure 7-2 (ii) »/ 


program given in 


/» Command file for SIMS »/ 
The input matrices are 


1 2 

B = 

2 2 2 

3 4 


2 2 2 

5 6 


wmmmM 


/» Running the Simulator »/ 

Script started on Mon Jan 13 15:31:56 1992 
$ sims 

_Arch_file: syst.arh 

SIM> cmdfile 
_File; mat.cmd 

SIM> run 

Program terminated. 

SIM> ybl “f 
rdcntr 1=0 
wrcntrl«9 
_Address: 

OOOCMi ! 6.00CM)OOe+00 
OOOlh ! 6.000000e+00 
0002h s 6.000000e+00 
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00O3h ! 1.400000e+01 
00O4h : 1.400000e+01 
0005h : 1.400000e+01 
0006h ! 2.200000e+01 
0007h : 2.200000e+0l 
0008h ! 2.200000e+01 

SIM> exit 
^Verify: Y 
$ exit 

script done on Mon Jan 13 15:32:48 1992 


2. 1-D Convolution problem 

/» S2 program for the convolution problem »/ 

XfOl xC33 /* the sequence xCnl */ 

YAMEH yCS3 /x the convolution output «/ 

DMEM h /k an element of the second sequence, hCnl «/ 


cell! ) 


receiveCY, y 1, y) ji 
do i*>lv3 

rccciveCX, xdata, x) i 
send(X,x>i 
receive (Y, ydata, y ) ; 
sum » h s xdata + ydataj 
send (Y, sum, y>; 
enddoi 

receive CY, ydata, y) f 
sendCY, ydata, y > f 
BciMlCY,yl,y>i 




above »/ 


/» SYSC output for the convolution program given 


/» Code for the cell »/ 

blockl: wrcntr(CO) & 2 8000h+3-2 & ken & xbiout^constf 

2 O & ken & xbiout_constf & yrtrtrOl & dsel 
rdyq & xbbport_yi & reg_bwr & b__addr(#Oh) 
block2: rdxq & xbbport_xi & reg_bwr & b_addr<#lhJ 


wrxq j, I 

reg_crd & c_addr(#2h) 8c iBul_aen 8c xbbpor t "" 
b_addrt#2h> 8c rddm 8c yrtr(rO) 
reg_crd 8c c_addr(#lh> 8c raul_ben 
cont 

K & a addrt#3dJ 

moen 8c reg_drd 8c d_addr(#3h) 8c alu_aen 8c reS— . r- „„ tjwr 
rdyq 8c reg_erd 8c e_addrt#4h> 8c alu_ben 8c xbbpor _y — 

b addrC*4hl 


bwr 


Sc 


sadd 

cont 

aoen 8c xbyout_aluspout 
wryq 

dccntr tCO) 

Jda(sign> 8c 2 block2 8c ken 8c xbiout_constf 
blocks: rdyq 8c xbyout_yi 

wy" y h 4^ Sc 

reg_erd 8e e_addr(#Oh) 8c xbyout_aluspout 8c 


ain_to_out 

wryq 


/k Code for the IFU k/ 

blockl: rdya 
wr yq 

wrcntrtCO) 8c 2 8000h+3-2 8c ken 
block2: rdx 
wrxq 
rdya 
wr yq 

dccntr (CO) 

Jda(sign) 8c 2 block2 8c ken 
blocks: rdya 
wryq 


/« Running on the simulator «/ 
data xC31 =123 
hC33 = 2 2 2 

Script started on Fri Jan 24 12:18:13 1992 
it sims 

Arch file: syst.arh 
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/» SYSC output for the convolution program given above »/ 

/» Code for the cell */ 

blockl: wrcntrtCOl & 2 8(K)0h+3-2 & ken & xbiout_constf 

2 0 Sc ken Sc xbiout_constf Sc yrtrtrOl Sc dsel 
rdyq Sc xbbport_yi Sc regjjwr Sc b_addr<:#Oh) 
block2: rdxg Sc xbbport_xi Sc reg_bwr Sc b_addr(#lh> 
wrxg 

reg_crd Sc c_addr<#2h) Sc mul_aen Sc xbbpor t_dmout Sc regjbwr Sc 

b_addrC#2h> Sc rddm Sc yrtrCrOl 

reg_crd Sc c_addr(#lh) Sc mul_ben 

cont 

cont 

moen Sc reg_drd Sc d_addr(#3h) Sc alu_aen Sc reg_awr Sc a_addr<#3h) 

rdyq Sc reg__erd Sc e_addrC#4h> & alu_ben Sc xbbpor t_yi Sc reg_bwr Sc 

b_addr C#4hJ 

sadd 

cont 

aoen Sc xbyout_aluspout 
wryq 

dccntr tCO) 

jdaCsign) Sc 2 block2 Sc ken Sc xbiout_constf 
block3: rdyq Sc xbyout_yi 
wryq 

reg_erd Sc e_addrt#Oh> Sc xbyout_aluspout Sc aluout_bufen Sc 

ain_to_out 

wryq 


/» Code for the IFU jt/ 

blockl: rdya 
wryq 

wrcntr CCO) Sc 2 8000h+3-2 Sc ken 
block2: rdx 
wrxq 
rdya 
wryq 

dccntr tCO) 

JdaCsign) Sc 2 block2 Sc ken 
block3: rdya 
wryq 

/« Running on the simulator «/ 
data xC33 =123 
hC33 = 2 2 2 


Script started on Fri (Tan 24 12:18:13 1992 
$ sims 

^Arch file: syst.arh 
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goto 30j 

20s send (Y, y. A) j 
30: enddo; 


J 


/» SYSC output for the sorting program given above a/ 

/s Code for the Cell »/ 

blockls 2 0 St ken Sc xbiout_constf Sc yrtrCrO) St dsel 

2 0 S: ken Sc >{bdmin_canstf Sc wrdm Sc axbdm_agi Sc yrtrCrO) 
wrcntrCCO) Sc 2 SOOOh+5-2 S: ken Sc xbiout_cQnstf 
blQck25 reg_drd Sc d_addrC#Oh) Sc alu_aen Sc xbbpor t_dmout Sc reg_bwr Sc 
b_addrt#Oh> Sc axbdm_agi Sc yrtr(rO) Sc rddm 
rdyq S-. reg_erd Sc e__addrC#lh3 Sc alu_ben Sc xbbport_yi Sc 
rHg_bwr Sc b_addr<.#lh) 
scorop 
cont 

jdaCflag) Sc greaterthan Sc 2 block# Sc ken Sc Kbiout_constf 
block3; rddm Sc xbyout_dtnout Sc yrtr(rO) 
wr yq 

reg_erd Sc e_addrC#lh) Sc xbdmin_aluspout Sc aluout_bufBn Sc 
ain_to_out Sc wrdm Sc axbdm_agi Sc yrtrtrO) 
jda( uncondit ional ) Sc 2 blocks Sc ken Sc xbiout_constf 
block#: reg_erd Sc e_addr(.#lh) Sc xbyout_aluspout Sc aluoutjsufen S: 
ain_to_out 
wr yq 

blocks: dccntrtCO) 

jda(sign) Sc 2 block2 Sc ken Sc xbiQut_con5tf 
/w Code for the IFU s/ 

blockl: wrcntr(CO) S 2 3000h+5-2 S ken 
blQck2; rdya 
wr yq 

blocks; dccntrtCO) 

jda(sign) Sc 2 block2 Sc ken 


/s Running the code on simulator «/ 


Unsorted sequence = A = €3, 6, 15, 12 ,91 
Script started on Fri Jan 2# 12:51:27 1992 
f sims 

Arch file: syst.arh 



SIM> cmdfile 
_File! sort.cmd 

SIM> run 

Progran terminated. 

SIH> e lenient 

Current Cc»mand cell: S 

JMumber : 1 

SIH> dm -f 
_Address: 

OOOOh : 15.000000e+00 

SIH> element 

Current Command cell; 1 

_Number : 2 

SIM> dm -f 
_Address: 

OOOOh ; 12.000000B+00 

SIii> element 

Current Command cell: 2 

JMumber : 3 

SIM> dm -f 
_Address: 

OOOOh : 9.000000e+00 

SIM> element 

Current Command cell: 3 

JMumber: 4 

SIM> dm -f 
_Address: 

CKK)Oh : 6.000000e+00 



B-8 


cont 

cont 


rooen & reg_drd & d_addrC#2h) & alu_aen & reg_awr & a_addrt#2h) 

reg_erd & e_addrt#3hi & alu_ben & xbbport_diM 3 ut & reg_bwr & 

b_addr€#3h> & axbdn_agi & yrtrCrO) & rddm 

sadd 

cont 

wrxq 

wryq 

dccntr (C0> 

JdaCsign) block2 & ken & xbiout_constf 


/« Code for the IFU »/ 

block Is wrcntrtCOl 8. 2 8000h+5-2 & ken 

block2s rdx 
wr xq 
rdya 
wr yq 

dccntr (CO) 

Jda(sign) & ken & block2 
/» Running the code on simulator »/ 


Input xCl = 0,1, 2, 3, 4 

Polynomial = x*4 + 2x*3 + 3x*2 + 4x*l + 5 

Script started on Fri Jan 24 11:37:00 1992 
$ sims 

_Arch_files syst.arh 

SIli> cmdfile 
JFile: pol.cmd 

SIM> run 

Program terminated. 

SIM> ybl -f 


rdcntr 1=0 



wrcntrl=5 



_Addresss 



OOOOh 

• 

m 

5.000000e+00 

OOOlh 

m 

m 

1.500000e+01 

0002h 

m 

m 

5.700000e+01 

0003h 

m 

m 

1. 7900006+02 

0004h 

m 

m 

4.5300006+02 



SIH> exit 
_Verify! y 
$ exit 


script done on Fri Jan 24 11:37:28 1992 


